In any industry, data is regarded as an essential component. Data analysis and visualization can be performed in a variety of industries for a variety of goals. For example, businesses can use it to maximize profits, while government agencies can use it to depict demographic differences. The dataset I used for this project is “Mental Health Problems Dataset (MHPD),” which was studied and provided by a relative from MOHP, between 2011 and 2014. This data set provides information regarding mental health disorders such as depression, anxiety, epilepsy, psychosis, etc. in Nepal. The data set has 1576 rows and 16 columns with data such as date, district, zone, male, female, age, and so on. Throughout my report, I will analyze and visualize mental health concerns in Nepal based on age, gender, district, and so forth. During this project i will not only visualize the mental health conditions , but will also try to precisely predict results for future analysis through this data set, during machine learning techniques later on. I’ll be looking for insights, sorts of mental health disorders, and people’s conditions in Nepal between the ages of 18 and 60. Similarly, I will get insights on the people’s conditions in the geographical area where the data was collected.
The main aim of this project is to analyse Nepali Mental Health data and interpret different types of results.
First of all, I set the working directory and installed and imported libraries namely: dplyr, ggplot2, treemap, corrplot, plotly, leaflet, caret, e1071, tm, tidytext and wordcloud. this libraries will help to manipulate, analyse, visualize and predict the different state of the project.
#setting working directory
setwd("E:/Masters/second year/3rd sem/RProgramming/secondMilestone/Rmarkdown")
# class(MentalHealthData)
#importing libraries
library(dplyr)
library(ggplot2)
library(corrplot)
library(treemap)
library(leaflet)
library(tm)
library(tidytext)
library(wordcloud)
MentalHealthData <- read.csv("Mental_health_dataset.csv", sep=",")
dim(MentalHealthData)
## [1] 2625 18
class(MentalHealthData) #To see the data type
## [1] "data.frame"
head(MentalHealthData) # To see the initial first few observations
## S.N District_Name Zone Ecological_Belt Development_Region Year_BS Year_AD
## 1 0 Taplejung Mechi Mountain Eastern 2069 2012
## 2 1 Taplejung Mechi Mountain Eastern 2069 2012
## 3 2 Taplejung Mechi Mountain Eastern 2069 2012
## 4 3 Taplejung Mechi Mountain Eastern 2069 2012
## 5 4 Taplejung Mechi Mountain Eastern 2069 2012
## 6 5 Taplejung Mechi Mountain Eastern 2069 2012
## condition type Male Female Age Married Unmarried
## 1 Severe Dipression 26 24 19 27 23
## 2 Severe Psychosis 53 30 50 57 26
## 3 Major Anxiety (Neurosis) 24 32 21 37 19
## 4 Major Mental retardation 48 46 20 51 43
## 5 Major Conversive disorder (Hysteria) 49 29 60 45 33
## 6 Severe Alcoholism 30 20 27 49 1
## Education Employment lat
## 1 Some College, short continuing education or equivalent no 27.61859
## 2 None yes 27.61859
## 3 ##### yes 27.61859
## 4 College degree, bachelor, master yes 27.61859
## 5 ##### retired 27.61859
## 6 College degree, bachelor, master yes 27.61859
## long
## 1 87.85666
## 2 87.85666
## 3 87.85666
## 4 87.85666
## 5 87.85666
## 6 87.85666
tail(MentalHealthData) # To see the last few observations
## S.N District_Name Zone Ecological_Belt Development_Region Year_BS
## 2620 2619 Darchula Mahakali Mountain Far-Western 2077
## 2621 2620 Darchula Mahakali Mountain Far-Western 2077
## 2622 2621 Darchula Mahakali Mountain Far-Western 2077
## 2623 2622 Darchula Mahakali Mountain Far-Western 2077
## 2624 2623 Darchula Mahakali Mountain Far-Western 2077
## 2625 2624 Darchula Mahakali Mountain Far-Western 2077
## Year_AD condition type Male Female Age Married
## 2620 2020 Severe Psychosis 31 37 60 42
## 2621 2020 Minor Anxiety (Neurosis) 47 41 53 47
## 2622 2020 Severe Mental retardation 18 22 38 24
## 2623 2020 Severe Conversive disorder (Hysteria) 27 30 18 5
## 2624 2020 Normal Alcoholism 55 36 60 33
## 2625 2020 Normal Epilesy 39 30 24 48
## Unmarried Education Employment lat long
## 2620 26 Up to 9 years of school no 29.89271 80.74136
## 2621 41 College degree, bachelor, master yes 29.89271 80.74136
## 2622 16 College degree, bachelor, master no 29.89271 80.74136
## 2623 52 Up to 12 years of school no 29.89271 80.74136
## 2624 58 College degree, bachelor, master ##### 29.89271 80.74136
## 2625 21 College degree, bachelor, master yes 29.89271 80.74136
There are not so many columns that needs to be filted but in order to get data related to column firstly, the vector was created vector of columns that I want to keep. then i filtered the data with those columns and finally mapped the existing column names to new one with the help of another vector. lastly the row key has been reset and structure of the dataframe has been displayed using str() function.
#filtering columns / using short column names...
col_selections <- c('District_Name', 'Zone', 'Ecological_Belt', 'Development_Region', 'Year_BS',
'Year_AD', 'condition', 'type', 'Male', 'Female', 'Age', 'Married',
'Unmarried', 'Education', 'Employment', 'lat', 'long')
mental_df <- MentalHealthData[,col_selections]
colnames(mental_df) <- c('DName', 'Zone', 'EBelt', 'DRegion', 'YearB',
'YearA', 'condition', 'type', 'Male', 'Female', 'Age', 'Married',
'Unmarried', 'Edu', 'Emp', 'lat', 'long')
row.names(mental_df) <- NULL
str(mental_df)
## 'data.frame': 2625 obs. of 17 variables:
## $ DName : chr "Taplejung" "Taplejung" "Taplejung" "Taplejung" ...
## $ Zone : chr "Mechi" "Mechi" "Mechi" "Mechi" ...
## $ EBelt : chr "Mountain" "Mountain" "Mountain" "Mountain" ...
## $ DRegion : chr "Eastern" "Eastern" "Eastern" "Eastern" ...
## $ YearB : int 2069 2069 2069 2069 2069 2069 2069 2069 2069 2069 ...
## $ YearA : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ condition: chr "Severe" "Severe" "Major" "Major" ...
## $ type : chr "Dipression" "Psychosis" "Anxiety (Neurosis)" "Mental retardation" ...
## $ Male : int 26 53 24 48 49 30 54 54 54 59 ...
## $ Female : int 24 30 32 46 29 20 25 49 37 26 ...
## $ Age : int 19 50 21 20 60 27 28 58 37 59 ...
## $ Married : int 27 57 37 51 45 49 24 23 22 53 ...
## $ Unmarried: int 23 26 19 43 33 1 55 80 69 32 ...
## $ Edu : chr "Some College, short continuing education or equivalent" "None" "#####" "College degree, bachelor, master" ...
## $ Emp : chr "no" "yes" "yes" "yes" ...
## $ lat : num 27.6 27.6 27.6 27.6 27.6 ...
## $ long : num 87.9 87.9 87.9 87.9 87.9 ...
There are only two columns in this dataset containing null or uncleaned data, and they have been handled by setting them to 0. The null field has been converted into 0 for education and employment. Finally, empty fields are depicted using the colsum() and is.na() methods.
mental_df$Edu[is.na(mental_df$Edu)] <- 0
mental_df$Emp[is.na(mental_df$Emp)] <- 0
colSums(is.na(mental_df))
## DName Zone EBelt DRegion YearB YearA condition type
## 0 0 0 0 0 0 0 0
## Male Female Age Married Unmarried Edu Emp lat
## 0 0 0 0 0 0 0 0
## long
## 0
During this step, general data analysis was carried out. Subsequently, summary statistics for the variables in the dataset were computed. Similarly, multiple univariate analyses were carried out.
Summary statistics ia the first step after data transformation. For this stage, summary statistics (mean median, quartiles, mode) are calculated for different numerical fields and structure of character fields are demonstrated.
summary(mental_df)
## DName Zone EBelt DRegion
## Length:2625 Length:2625 Length:2625 Length:2625
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## YearB YearA condition type
## Min. :2069 Min. :2012 Length:2625 Length:2625
## 1st Qu.:2070 1st Qu.:2013 Class :character Class :character
## Median :2071 Median :2014 Mode :character Mode :character
## Mean :2072 Mean :2015
## 3rd Qu.:2074 3rd Qu.:2017
## Max. :2077 Max. :2020
## Male Female Age Married
## Min. :18.00 Min. :18.00 Min. :18.00 Min. : 1.00
## 1st Qu.:28.00 1st Qu.:28.00 1st Qu.:28.00 1st Qu.:27.00
## Median :39.00 Median :38.00 Median :39.00 Median :38.00
## Mean :39.22 Mean :38.77 Mean :38.76 Mean :37.36
## 3rd Qu.:51.00 3rd Qu.:50.00 3rd Qu.:50.00 3rd Qu.:48.00
## Max. :60.00 Max. :60.00 Max. :60.00 Max. :60.00
## Unmarried Edu Emp lat
## Min. : -6.00 Length:2625 Length:2625 Min. :26.57
## 1st Qu.: 25.00 Class :character Class :character 1st Qu.:27.22
## Median : 40.00 Mode :character Mode :character Median :27.95
## Mean : 40.63 Mean :28.00
## 3rd Qu.: 55.00 3rd Qu.:28.69
## Max. :108.00 Max. :30.04
## long
## Min. :80.38
## 1st Qu.:82.36
## Median :84.24
## Mean :84.25
## 3rd Qu.:86.01
## Max. :87.96
table(mental_df$YearA)
##
## 2012 2013 2014 2015 2016 2017 2018 2019 2020
## 525 525 525 140 104 281 84 333 108
In order to visualize the distribution of Year, I have implemented the boxplot for Year which is illustrated below:
#boxplot for Age
boxplot(mental_df$YearB)
# The above box plot illustrates the distribution of Year in the
dataset. Y axis indicates year and x axis indicated the frequency. line
between the box represents the dense distribution indicating the most
data mental issues is between 30 to 50. similarly, we can see in the
figure most issues are between 2069-2077 year.
Furthermore, I have filtered the data for Age of 18 and its result has been filtered below:
mental_df%>% filter(Age == 18) %>% select(YearA, condition, type, DName, Emp)
## YearA condition type DName Emp
## 1 2012 Normal Anxiety (Neurosis) Udaypur yes
## 2 2012 Severe Anxiety (Neurosis) Dhanusha yes
## 3 2012 Minor Alcoholism Kavre no
## 4 2012 Major Mental retardation Rautahat no
## 5 2012 Severe Psychosis Gorkha no
## 6 2012 Severe Psychosis Parbat no
## 7 2012 Severe Alcoholism Gulmi no
## 8 2012 Severe Conversive disorder (Hysteria) Rolpa no
## 9 2012 Minor Psychosis Dang no
## 10 2012 Minor Mental retardation Bardiya no
## 11 2012 Major Mental retardation Surkhet no
## 12 2012 Severe Anxiety (Neurosis) Mugu no
## 13 2012 Normal Epilesy Kanchanpur no
## 14 2012 Major Alcoholism Dadeldhura yes
## 15 2013 Major Mental retardation Siraha no
## 16 2013 Minor Alcoholism Siraha no
## 17 2013 Normal Alcoholism Sindhupalchowk no
## 18 2013 Severe Mental retardation Kavre no
## 19 2013 Normal Alcoholism Bhaktapur yes
## 20 2013 Major Epilesy Bhaktapur yes
## 21 2013 Major Dipression Rautahat yes
## 22 2013 Severe Conversive disorder (Hysteria) Arghakhanchi no
## 23 2013 Major Alcoholism Bardiya no
## 24 2013 Severe Mental retardation Mugu no
## 25 2013 Severe Conversive disorder (Hysteria) Darchula no
## 26 2014 Major Conversive disorder (Hysteria) Udaypur no
## 27 2014 Major Psychosis Dolkha yes
## 28 2014 Normal Dipression Kavre no
## 29 2014 Severe Psychosis Lalitpur no
## 30 2014 Severe Mental retardation Parsa no
## 31 2014 Normal Anxiety (Neurosis) Tanahu no
## 32 2014 Minor Psychosis Rukum no
## 33 2014 Normal Epilesy Dolpa yes
## 34 2014 Severe Dipression Bajhang yes
## 35 2014 Severe Epilesy Achham no
## 36 2015 Normal Anxiety (Neurosis) Udaypur yes
## 37 2015 Severe Anxiety (Neurosis) Dhanusha yes
## 38 2016 Minor Alcoholism Kavre no
## 39 2016 Major Mental retardation Rautahat no
## 40 2017 Severe Psychosis Gorkha no
## 41 2017 Severe Psychosis Parbat no
## 42 2017 Severe Alcoholism Gulmi no
## 43 2017 Severe Conversive disorder (Hysteria) Rolpa no
## 44 2017 Minor Psychosis Dang no
## 45 2017 Minor Mental retardation Bardiya no
## 46 2017 Major Mental retardation Surkhet no
## 47 2017 Severe Anxiety (Neurosis) Mugu no
## 48 2017 Normal Epilesy Kanchanpur no
## 49 2017 Major Alcoholism Dadeldhura yes
## 50 2019 Major Mental retardation Siraha no
## 51 2019 Minor Alcoholism Siraha no
## 52 2019 Normal Alcoholism Sindhupalchowk no
## 53 2019 Severe Mental retardation Kavre no
## 54 2019 Normal Alcoholism Bhaktapur yes
## 55 2019 Major Epilesy Bhaktapur yes
## 56 2019 Major Dipression Rautahat yes
## 57 2019 Severe Conversive disorder (Hysteria) Arghakhanchi no
## 58 2019 Major Alcoholism Bardiya no
## 59 2020 Severe Mental retardation Mugu no
## 60 2020 Severe Conversive disorder (Hysteria) Darchula no
The above data frame illustrated there lies Mental issues for the age of 18, which is less compared to those of Age between 30-50. through this we can draw a picture that after 30-50 most people in Nepal suffer from various Mental Issues. Furthermore, the filter shows that most of the unemployed people who are of age 18 has suffered from Mental Health issues.
After demonstration of box plot to visualize the distribution of years i have also visualized it with the help of histogram plot.
#histogram
ggplot(mental_df, aes(Age))+geom_histogram(bins="35")
The graph above depicts the relationship between age and the frequency
of mental disorders. The frequency count is shown on the y axis, and the
age is shown on the x axis. The figure appears to be rising in nature,
implying that mental disorders begin to increase at the age of 30 and
continue to increase until the age of 60. In your twenties, the problems
appear to be few and far between, but they become more prevalent beyond
30 years. Similarly, I plotted a log of the number of people with mental
illnesses.
below i have plotted data for Male suffering from Mental issues, followed by female.
hist(log(mental_df$Male))
The distribution of log of Males suffering from mental illness from age
18 - 60. the log shows the frequency of increment that means, Males with
higher age between 30-60 have been affected more.
hist(log(mental_df$Female))
for female it’s quite different the log shows that female has been suffering more from Mental issues as compared to male population in Nepal.
As we all know, correlation is the initial stage of data exploration. Correlation between variables can be performed using the cor() function. The code below shows the relationship between variables such as married, unmarried, employed, and so on.
#Correlation
cor(MentalHealthData$Age, MentalHealthData$Male)
## [1] 0.03196041
The following result is roughly 0.10, indicating that age is connected to male type but not strongly. Rather of limiting ourselves to one correlation, let us depict it using a corrplot diagram.
#Corrplot
mentalCor <- MentalHealthData[,c('Year_BS', 'Year_AD', 'Male', 'Female', 'Age', 'Married', 'Unmarried')]
mentalCor <- na.omit(mentalCor)
correlations <- cor(mentalCor) #correlation
corrplot::corrplot(correlations, method="circle") #heatmap
Correlation can range between -1 and 1. The graph above displays the
relationship between many variables. In this case, both axes have all of
the variables colored to show their relationship to one another. The
stronger the association, the closer the number is to 1 or -1.
Similarly, a value near 0 has very little correlation. When there is a
positive correlation, one tends to increase with an increase in the
other, and when there is a negative correlation, one tends to decrease
with an increase in the other. The above diagram shows that the number
of people suffering from mental illnesses has a strong positive
relationship with the number of adults aged 30-60. Similarly, unemployed
people appear to have a positive relationship with issue type.
The next step would be scatter plot after demonstrating the correlation and heatmap. Scatter plot takes two variables as input and plots point in the graph. Similarly in this context we have taken year and number of people killed and plotted in scatterplot below:
#Mental issues per year
plot(mental_df$YearA, log(mental_df$Age), main="Year vs Age", xlab = "Year", ylab = "Age")
The graph above depicts a log of the number of persons affected by
various types of mental illnesses from 2012 to 2014. According to the
diagram, at first, just a small number of persons between the ages of 18
and 30 were impacted. which steadily climbed after the age of 30 and
decreased again after the age of 60 or more The direction, however, is
not stable because the correlation between the variables is
insignificant. We may plot four different sorts of variables at the same
time using ggplot. two on the axes, one for color and the other for
shape
ggplot(mental_df, aes(Age, log(Male), col=as.factor(type), shape=as.factor(condition)))+geom_point()
The above diagram demonstrates log of number of male suffering from
mental issues from age 18-60 and are in different condition. As per the
diagram, initially number of people have minor, major or severe
condition, we can also see few are normal. For the age 30 or more there
are more major condition as compared to normal.
Now, lets plot Mental issues by age here because we don’t have much yearly data in point graph with the line. For this I grouped the issues by age and retrieved frequency of issues by age and ploted it using ggploot geom_point() and stat_smooth() method.
#Age issues per year
affect_by_age <- mental_df %>%
group_by(Age) %>%
summarise(count=n())
ggplot(affect_by_age, aes(x=Age, y=count))+geom_point()+stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#per year
affect_by_year <- mental_df %>%
group_by(YearB) %>%
summarise(count=n())
ggplot(affect_by_year, aes(x=YearB, y=count))+geom_point()+stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The above diagram illustrates the frequency of Mental issues Nepal. As
per the diagram we can see the frequency of Mntal condition is
increasing between 30-60 of age, diminished little until around the age
of 30-45 and increased again. as per the diagram we may assume that
people above the age of 30 are likely to be affected by Mental issues
till the age of 60.
Now lets perform real visualization. Here i am visualizing the frequency of terrorist attacks on Nepal by the type of attack. I will plot year and frequency of attacks on x and y axis respectively and fill the color by the type of attack.
ggplot(data=mental_df, aes(x=Age,fill=type)) + geom_bar() + ggtitle("Mental issues by age in Nepal")+
labs(x = "Age", y = "Mental Issues/ condition Types")
The above diagram demonstrates the mental issues faced by various range
of people by age of 18-60. On the X axis lies the age of people, on the
y axis lies the frequency of the types and different colors are plotted
on graph for different types. as per the diagram, mental retardation,
epilepsy depression and conversive disorder is increasing where as
anxiety and alcoholism is rising more as compared to others.
Now lets visualize the frequency of terrorist attacks of Mental issues Nepal by the issues type. I will plot Age and frequency of issues types on x and y axis respectively and group the plots by the type.
mental_condition <- mental_df[which(mental_df$type !='.'), ]
ggplot(mental_condition, aes(x = Age))+ labs(title =" Mental Issues by age with type", x = "Age", y = "Types of Mental Issues") +
geom_bar(colour = "grey19", fill = "tomato3") + facet_wrap(~type, ncol = 4) + theme(axis.text.x = element_text(hjust = 1, size = 12))+
theme(strip.text = element_text(size = 16, face = "bold"))
The above diagram demonstrates the Age with respect to type. On the X
axis lies the Age of people, on the y axis lies the frequency of the
types and different groups are plotted on graph for issues type. as per
the diagram, alcoholism is higher at the age of 30 to 40, anxiety i
higher among 30, 50. psychosis is higher among age of 40.
Now lets analyze and visualize the data of Male in dataset as per the zone of Nepal. Firstly I have plotted treemap with the number of Males per zone. and I have filtered dataset with number of Males age greater than 30 per zone and plotted into the graph.
dzone <- mental_df %>% filter(Male > 30)
treemap(dzone,
index=c("Zone"),
vSize = "Male",
palette = "Reds",
title="Male per Zone",
fontsize.title = 12,
type="value",
title.legend = "Number of Male by zone",
)
the above tree maps shows no of male by zone.
Setting up Null Hypothesis ->There is no significant difference in condition for number of people of age 18 to 45 . Seting up data for the T-Test ->Filtering the data with condition typ either severe or major only and selecting condition and age from the dataframe.
conditionIssues <- mental_df %>%
select (condition, Age) %>%
filter(tolower(condition)=="severe" | tolower(condition)=="major")
Applying T-Test to resulted dataframe
t.test(data =conditionIssues, Age ~ condition )
##
## Welch Two Sample t-test
##
## data: Age by condition
## t = 2.3458, df = 1320.9, p-value = 0.01913
## alternative hypothesis: true difference in means between group Major and group Severe is not equal to 0
## 95 percent confidence interval:
## 0.267441 2.999826
## sample estimates:
## mean in group Major mean in group Severe
## 40.07808 38.44444
According to the above analysis of the findings, the p-value is very near to zero, indicating that our null hypothesis is extremely unlikely. I rejected the null hypothesis, i.e. the difference in the means of adults aged 18 to 40 is 0. Therefore, with the exception of the only alternative, the difference between them is not 0 but there is a real difference. Similarly, the confidence interval for the difference revealed in the Above Test is 95%.
For the next dynamic visualization, I have implemented map plot for Mental Issues of different regions. Firstly I filtered the data with null latitude and longitude. Then I added map image with the markers of Nepal then i provided, both bs and ad date, Ecological Belt, zone, illness type, illness condition, male & female with their corresponding age, and finally i set clusterOptions to markerClusterOptions() function.
mental_dfll <- mental_df %>%
filter(!is.na(lat) & !is.na(long))
clusterNepalMap <- leaflet() %>%
addTiles('https://cartocdn_{s}.global.ssl.fastly.net/base-midnight/{z}/{x}/{y}.png',
attribution='© <a href="http://www.openstreetmap.org/copyright">OpenStreetMap
</a> © <a href="http://cartodb.com/attributions">CartoDB</a>')
clusterNepalMap %>% addMarkers(data=mental_dfll,popup=paste0("<div class=\"table-title\">
<h3>Mental Issues of Nepal Details</h3>
</div>)
<table class=\"table-fill\">
<thead>
<tr>
<th class=\"text-left\">MetaData Name</th>
<th class=\"text-left\">MetaData Value</th>
</tr>
</thead>
<tbody class=\"table-hover\">
<tr>
<td class=\"text-left\">Event Date</td>
<td class=\"text-left\">",mental_dfll$YearA,'/',mental_dfll$YearB,"</td>
</tr>
<tr>
<td class=\"text-left\">Ecological Belt</td>
<td class=\"text-left\">",mental_dfll$EBelt,"</td>
</tr>
<tr>
<td class=\"text-left\">Zone</td>
<td class=\"text-left\">",mental_dfll$Zone,"</td>
</tr>
<tr>
<td class=\"text-left\">Illness Type</td>
<td class=\"text-left\">",mental_dfll$type,"</td>
</tr>
<tr>
<td class=\"text-left\">Condition</td>
<td class=\"text-left\">",mental_dfll$condition,"</td>
</tr>
<tr>
<td class=\"text-left\">Male</td>
<td class=\"text-left\">",mental_dfll$Male,"</td>
</tr>
<tr>
<td class=\"text-left\">Female</td>
<td class=\"text-left\">",mental_dfll$Female,"</td>
</tr>
<tr>
<td class=\"text-left\">Age</td>
<td class=\"text-left\">",mental_dfll$Age,"</td>
</tr>
</tbody>
</table>"), clusterOptions = markerClusterOptions())
## Assuming "long" and "lat" are longitude and latitude, respectively
The accompanying diagram illustrates mental health difficulties ranging from severe to severe in Nepal’s several districts. This map is interactive in HTML (made by rmarkdown), and it can be zoomed in for more extensive data analysis. This map depicts mental health issues from several districts in Nepal. According to the map, the most of the issues are from Bheri, Rapti, and Bagmati.
Linear regression is a function that provides the change or update in output as a result of a change in input. The plotted point graph with geom smooth() method is addressed for linear regression to check the linear relationship between the variables.
issues <- mental_df %>%
group_by(YearB) %>%
summarise(count=n())
ggplot(issues, aes(x=YearB, y=count))+geom_point()+geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
The horizontal line detects the two variables in this diagram’s linear relationship between year and frequency of Mental problems. As input and output variables, the year and frequency of Mental Issues are employed. and can fit linear models, as illustrated in the image above. The majority of the values are near to the line, while some are spread out to form outlines.
cor(issues$YearB, issues$count)
## [1] -0.7032947
The correlation between year and frequency of assault is -0.7032947, indicating a negative relationship between the variables, implying that a drop in one tends to increase another and likewise.
In basic linear regression, there is only one input parameter and one corresponding output. The incidence of psychological illnesses and the year of occurrence are employed for understanding in this scenario, with the year serving as an input parameter and the frequency serving as a single output.
model_linear_regression = lm(issues$count ~ issues$YearB)
model_linear_regression
##
## Call:
## lm(formula = issues$count ~ issues$YearB)
##
## Coefficients:
## (Intercept) issues$YearB
## 103423.42 -49.75
#summary(simple_linear_regression)
As per the result, intercept is 103423.42 and slope is -49.75 for the year variable. From the above interpretation, the formula can be written as no_of_mental_issues = -49.75*year 103423.42
Two different things can be illustrated from the formula 1. For the unit change in year, number of mental issues decreases by -49.75 2. For year (future) = 2030 we can know no_of_mental_issues will be -49.75*103423.42 = -5,145 mental issues might be predicted for year 2030.
There are multiple input variables and one contineus output variable in multi linear regression. Here i have taken number of kills as output variable being based on Age, Male, Female, Married, Unmarried, YearB as input variable. firstly I converted all the factor fields as factor. After that i formulated the linear regression formulae.
mental_issues_df_factored <- mental_df %>%
select(Age, Male, Female, Married, Unmarried, YearB)
mental_issues_df_factored$Male <- as.factor(mental_issues_df_factored$Male)
mental_issues_df_factored$Female <- as.factor(mental_issues_df_factored$Female)
mental_issues_df_factored$Married <- as.factor(mental_issues_df_factored$Married)
mental_issues_df_factored$Unmarried <- as.factor(mental_issues_df_factored$Unmarried)
mental_issues_df_factored <- mental_issues_df_factored[complete.cases(mental_issues_df_factored),]
reg_model <- lm(Age ~ Male + Female + Married + Unmarried + YearB,
data = mental_issues_df_factored)
anova(reg_model)
## Analysis of Variance Table
##
## Response: Age
## Df Sum Sq Mean Sq F value Pr(>F)
## Male 42 8201 195.25 1.3687 0.058179 .
## Female 42 10286 244.91 1.7169 0.002939 **
## Married 48 34503 718.81 5.0389 < 2.2e-16 ***
## Unmarried 100 16802 168.02 1.1778 0.113836
## YearB 1 14 13.98 0.0980 0.754254
## Residuals 2391 341077 142.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#summary(reg_model)
The Aalysis of Variance table is shown above. When the input variables are categorical and the output variable is continuous, it is implemented. It demonstrates that the age and type sectors have substantial p values, while the rest are not as essential for prediction.
Predicting the
predict(reg_model, mental_issues_df_factored[1:30, c('Age', 'Male', 'Female', 'Married', 'Unmarried', 'YearB')])
## 1 2 3 4 5 6 7 8
## 34.93144 47.68807 34.96732 32.47983 43.31426 35.78129 37.40651 49.75066
## 9 10 11 12 13 14 15 16
## 38.46377 42.57061 43.56266 43.61216 41.82052 36.73292 36.55503 34.72408
## 17 18 19 20 21 22 23 24
## 33.19230 42.68709 44.35365 43.38397 46.07905 38.84516 38.32597 42.75391
## 25 26 27 28 29 30
## 39.24865 33.83515 39.62993 39.70167 36.84153 41.22578
predict(reg_model, mental_issues_df_factored[1:30, c('Age', 'Male', 'Female', 'Married', 'Unmarried', 'YearB')], interval = "confidence")
## fit lwr upr
## 1 34.93144 28.42093 41.44194
## 2 47.68807 40.81776 54.55838
## 3 34.96732 28.54748 41.38716
## 4 32.47983 26.14726 38.81240
## 5 43.31426 37.54142 49.08710
## 6 35.78129 29.12507 42.43751
## 7 37.40651 31.65798 43.15504
## 8 49.75066 39.81200 59.68932
## 9 38.46377 31.76361 45.16394
## 10 42.57061 35.84652 49.29471
## 11 43.56266 37.43974 49.68559
## 12 43.61216 37.43745 49.78686
## 13 41.82052 35.93780 47.70323
## 14 36.73292 30.43228 43.03355
## 15 36.55503 28.79823 44.31182
## 16 34.72408 26.36553 43.08263
## 17 33.19230 26.18772 40.19687
## 18 42.68709 34.57457 50.79961
## 19 44.35365 38.04304 50.66426
## 20 43.38397 37.63780 49.13014
## 21 46.07905 39.61850 52.53961
## 22 38.84516 32.27622 45.41410
## 23 38.32597 31.70099 44.95094
## 24 42.75391 36.24128 49.26654
## 25 39.24865 33.46949 45.02780
## 26 33.83515 27.61785 40.05246
## 27 39.62993 31.94815 47.31171
## 28 39.70167 33.55319 45.85015
## 29 36.84153 31.20081 42.48226
## 30 41.22578 34.61609 47.83547
#reg_model$residuals
Finally, we can see from the following example how we can use linear regression to predict a continuous variable.
The textfield in the dataset motive is where we may perform text analysis. In this case, 2% of the sample entries will be chosen at random. The variable “type” in the dataset is transformed to text, numerals, punctuation, and stop words are deleted, and a term document matrix is created. The text is then sorted in decreasing order. Spaces will be deleted from often used and descriptive keywords such as The, depression, psychosis, and so forth. The wordcloud was created using the tm and tm map libraries. Finally, it is plotted as wordcloud using the wordcloud library.
#Text analysis of motive of attacks
mental_df$type <- tolower(mental_df$type)
known_df <- mental_df %>%
filter(type != "Alcoholism")
er_df <-known_df %>%
filter(type != "")
text <- sample(er_df$type, nrow(er_df)/2)
specificWords <- c("Dipression", "Conversive disorder (Hysteria)", "Epilesy", "Mental retardation", "Epilesy", "Psychosis", "Anxiety (Neurosis)", "Dipression", "Conversive disorder (Hysteria)", "Epilesy", "Mental retardation", "Epilesy")
text<-sapply(text, function(x) gsub("\n"," ",x))
myCorpus<-VCorpus(VectorSource(text))
myCorpusClean <- myCorpus %>%
tm_map(content_transformer(removeNumbers)) %>%
tm_map(content_transformer(removePunctuation)) %>%
tm_map(content_transformer(removeWords),tidytext::stop_words$word) %>%
tm_map(content_transformer(removeWords),specificWords)
myDtm = TermDocumentMatrix(myCorpusClean,
control = list(minWordLength = 3))
freqTerms <- findFreqTerms(myDtm, lowfreq=1)
m <- as.matrix(myDtm)
v <- sort(rowSums(m), decreasing=TRUE)
myNames <- names(v)
d <- data.frame(word=myNames, freq=v)
wctop <-wordcloud(d$word, d$freq, min.freq=5, colors=brewer.pal(9,"Set1"))
I projected mental difficulties text to wordcloud to graphically depict
word frequency. According to the following wordcloud diagram, we can
visualize the different forms of mental illnesses from different regions
of Nepal. Mental retardation, psychosis, retardation, sadness, and other
weighty adjectives tend to be associated with the mental issue type.
In any industry, data is regarded as an essential component. Data analysis and visualization can be performed in a variety of industries for a variety of goals. For example, businesses can use it to maximize profits, while government agencies can use it to depict demographic differences. The dataset I used for this project is “Mental Health Problems Dataset (MHPD),” which was studied and provided by a relative from MOHP, between 2011 and 2014. This data set provides information regarding mental health disorders such as depression, anxiety, epilepsy, psychosis, etc. in Nepal. The data set has 1576 rows and 16 columns with data such as date, district, zone, male, female, age, and so on. Throughout my report, I will analyze and visualize mental health concerns in Nepal based on age, gender, district, and so forth. During this project i will not only visualize the mental health conditions , but will also try to precisely predict results for future analysis through this data set, during machine learning techniques later on. I’ll be looking for insights, sorts of mental health disorders, and people’s conditions in Nepal between the ages of 18 and 60. Similarly, I will get insights on the people’s conditions in the geographical area where the data was collected.
Through this project, I learned R programming and its libraries like as dplyr, ggplot2, treemap, corrplot, plotly, and others. First, in this assignment, I learned about the dataset of my choosing, “Mental Health Dataset.” After loading the dataset, I gained a knowledge of dataframes. Then I did transformation and purification, and I discovered that a large dataset with 2625 rows and 18 columns may be broken down to extract only the necessary fields. Manipulation and display of datasets also taught me how to use Python/R programming to analyze large datasets. I applied the findings from the dataset, converting some variables to binary and others to basic.
I developed a grasp of variable correlation, how they can be connected to one another, and how we can graphically express variable correlation using corrplot. Data visualization skills were also acquired by graphically representing various variables using histogram plots, boxplots, graphs, and so on. I also learned about the T-Test statistical testing approach, as well as other machine learning techniques like Linear Regression and Text Analysis. This aided in making predictions and analyzing the dataset’s text. I also learned about the significance of data combining, data replacement, data removal, and data analysis and visualization in general.
Throughout this process, I had numerous difficulties, but they were all overcome, that not only enhanced my confidence and moreover helped strengthen my problem-solving skills.